LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform. In this project, I use LendingClub dataset that obtained from kaggle and this project is one of the project that I have done from the Python for Data Science and Machine Learning Bootcamp by Jose Portilla on Udemy.
Data Overview
Project Intro/Objective
Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), can we build a model thatcan predict wether or nor a borrower will pay back their loan? This way in the future when we get a new potential customer we can assess whether or not they are likely to pay back the loan. Keep in mind classification metrics when evaluating the performance of your model!
The "loan_status" column contains our label.
Project Library
- Numpy
- Pandas
- Matplotlib
- Seaborn
- scikit-learn (sklearn)
- TensorFlow
- Keras
- random
Starter Code
This starter code function is to create feat_info() function that will help us to know the description of a column
Data and Setup
In this section, I want to show some of the data information, it includes Data Frame Info and Data Frame Head. First, we import the basic modules and read the lending_club_loan_two.csv file.
After that, we can see the data frame info and head.
Exploratory Data Analysis
In this section, I want to know which variables are important, view summary statistics, and visualize the data.
loan_status Countplot
Because I will be attempting to predict loan_status, I try to create a plot that will show the loan_status. From the countplot below, we can see that the majority of the borrower can fully paid their loan.
loan_status Histogram
Then, I create a histogram to see the distribuion of the loan mount. From the histogram below, we can see that amount of the borrower loan is between $5000 - $20000.
Variable HeatMap
After knowing the loan status count and histogram, I try to understand the data using correlation. First, I have to know the correlation of the variable within the data frame.
Then, I create a HeatMap using that correlated data frame.
From the HeatMap above, we can see that the most correlated data is loan_amnt and installment. We can check the description of loan_amnt and installment using the starter code.
I try to create a scatter plot to understand the relationship between loan_amnt and installment. From the scatter plot below, we can see that loan_amnt and installment correlated with each other.
Then I try to create a boxplot to understand the relationship between loan_status and loan_amnt. From the boxplot below, we can see the loan_amnt does not have effect on loan_status.
I calculate the summary statistics for the loan amount, grouped by the loan_status.
Then I try to explore the grade and subgrades columns that LendingClub attributes to the loans.
Then I create countplot to understand the relationship between loan_status and grade. From the countplot below, we can see grade E, F, and G has problem to pay their loan. So we can explore more on the subgrade to confirm this.
From the countplot below, we can see that grade F and G has problem to pay their loan. Then I try to isolate Grade F and G and create a new countplot.
Below is the countplot for grade F and G.
I create a new column called 'loan_repaid' which will contain a 1 if the loan status was "Fully Paid" and a 0 if it was "Charged Off".
Then I create a bar plot showing the correlation of the numeric features to the new loan_repaid column.
Data Preprocessing
In this section, I remove or fill any missing data, remove unnecessary or repetitive features, and convert categorical string features to dummy variables.
Missing Value
First, I want to know is there any missing value on LendingClub dataset, and if there is a missing value, I want to know the percentage of the missing value.
From the data above, we can see that empt_title and empt_length is the most columns that with missing value. So, we have to check the desctiption of this columns name.
We get the definition of emp_title and emp_length. We can check how many unique employment title are there.
There are 173105 unique employment title. Realistically there are too many unique job titles to try to convert this to a dummy variable feature. We can drop the emp_title.
Then, I create a count plot of the emp_length feature column. First, I have to count how many category of emp_length are, then sort using ascending. After that, I can create a countplot.
Then I creat a countplot using loan_status as hue. From the counplot below, still doesn't really inform us if there is a strong relationship between employment length and being charged off, what we want is the percentage of charge offs per category. Essentially informing us what percent of people per employment category didn't pay back their loan. So, I create a barplot that will give me information about the relationship emp_length and loan_status.
First, I have to calculate the number of employment charged off and fully paid based on their emp_length. Then I can calculate the percentage and create a barplot.
From the barplot below, we can see that there is no strong relationship between emp_length and loan_status, we can drop the emp_length from our data frame.
We can check our data frame again and see the other missing value. We can see that we have missing value on title, we can check this column and compare it with purpose column, is it a repetead information?
We can see that purpose and title contains the same informatin. So, we can drop the title column.
After dealing with emp_title, emp_length, and title, we can move to other column that has mising value. At mort_acc column, we there are many missing data, so we have to deal with this column.
I've got the definition of mort_acc using starter code, then we count the value of the mort acc. I can fill this missing data with linear model or mean/median of mort_acc, or simply drop this column. But, first I have to see the correlation of mort_acc with other column.
Looks like the total_acc feature correlates with the mort_acc , I will try using fillna() approach. I will group the dataframe by the total_acc and calculate the mean value for the mort_acc per total_acc entry. To get the result below:
I will fill in the missing mort_acc values based on their total_acc value. If the mort_acc is missing, then we will fill in that missing value with the mean value corresponding to its total_acc value from the Series we created above. This involves using an .apply() method with two columns.
There are two more column that has missing value, but they account for less than 0.5% of the total data. I can drop this two columns.
Categorical Variables and Dummy Variables
I have done working with missing data. Now I just need to deal with the string values due to the categorical columns. First, I have to know which one is the string data, then I go through all the string features to see what I should do with them.
term Feature
I have to convert this string to either a 36 or 60 integer numeric data type. I will use .apply() to convert this string to integer numeric.
grade Feature
I already know that grade is part of subgrade, I can drop this column.
Then I convert the subgrade into dummy variables. Then concatenate these new columns to the original dataframe.
verification_status, application_type, initial_list_status, purpose Feature
I convert these columns: ['verification_status', 'application_type', 'initial_list_status', 'purpose'] into dummy variables and concatenate them with the original dataframe.
home_ownership Feature
First, I want to teview the value_counts for the home_ownership column.
I convert these to dummy variables, but replace NONE and ANY with OTHER, so that I end up with just 4 categories, MORTGAGE, RENT, OWN, OTHER. Then concatenate them with the original dataframe.
address Feature
I can use the zip code from the address column, I can feature engineer it and create a column called 'zip_code' that extracts the zip code from the address column.
issue_d Feature
I do not know how to use this data, because I wouldn't know beforehand whether or not a loan would be issued when using my model, so in theory I wouldn't have an issue_date, I have to drop this feature.
issue_d Feature
This appears to be a historical time stamp feature. I will rxtract the year from this feature using a .apply function, then convert it to a numeric feature. I will set this new data to a feature column called 'earliest_cr_year'.Then I will drop the earliest_cr_line feature.
Model Preparation
Train Test Split
First, I import train test split from sklearn. Then drop the loan_status since its duplicate of loan_repaid column, and set X and y variables to the .values of the features and label.
After that, I perform train test split with test_size=0.2 and a random_state of 101.
Normalizing the Data
I will use a MinMaxScaler to normalize the feature data X_train and X_test. Then I only fit on the X_train data because I don't want data leakage from the test.
Creating the Model
Load TensorFlow and Keras
I will use tensorflow.keras.models to import Sequential and tensorflow.keras.layers to import Dense and Dropout.
For information, TensorFlow is an open-sourced end-to-end platform, a library for multiple machine learning tasks, while Keras is a high-level neural network library that runs on top of TensorFlow. Both provide high-level APIs used for easily building and training models, but Keras is more user-friendly because it’s built-in Python.
Before setting the model, I have to know the shape of X_train. Using the code above, I know that X_train have 78 output. I can set the a model that goes 78 --> 39 --> 19--> 1 output neuron. Because we have two hidden layers, we can called this model as Deep Learning.
I will use keras sequential model. Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.
I will use keras dense layer. Dense layer is a neural network layer that is connected deeply, which means each neuron in the dense layer receives input from all neurons of its previous layer.
For activation function on Input layer and Hidden layers, I will use rectified linear activation function (relu). relu is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.
For activation function on Output layer and Hidden layers, I will use sigmoid. sigmoid is exists between 0 to 1, we can use this activation function to give me an output of 0 or 1.
I will set dropout layer to 0.2 in case of overfitting.
Because we want an output 0 or 1, I will use binary_crossentropy as loss.
I will use adam optimizer because adam optimizer is the most effective optimizer that we have right now.
In model.fit, I will fit the model to the training data for at least 25 epochs and add in the validation data for later plotting.
After done with model.fit, I can save my model.
Evaluating the Model
Valuation Loss versus Training Loss
To see how well my model perform, I can plot the valuation loss and training loss and see if there is much difference between this two value. If not, my model is good enough to use.
From the plot above, we can see that my training loss is not much different with valuation loss. I can say that my model is good enough but still has room for improvement.
Classification Report and Confusion Matrix
To create a classification report and confusion matrix, I have to import them from sklearn. Then I create a predictions variable using X_test then print it to see the result.
From the classification report, I got good result when predicting the fully paid person but got some problems with predicting the charged-off person, where I can see the recall and f1-socre value is not good enough. I have to explore more to the charged-off person.
From the confussion matrix, I got more False Positive than True Positive, but I got more True Negative than False Negative. From this confussion matrix, I can get information that my model is good enough to predict the charged-off person rather than fully paid person.
From the classification report and confussion matrix, I can conclude that I have to explore more on the charged-off person and try to manipulate the imbalance data to get more convincing results.
Additional Resources
- Header Backgrounds by Jason Leung at unsplash.com
- Kaggle Original Challenge Source
- For further explanation regarding python code, please kindly check this link.